Overview

This Rmarkdown documents contains the code and 2D simulations of two generic classes with linear separation. The

Packages

windowsFonts(A = windowsFont("DejaVu Sans")) 

palette("Tableau10")
library(caret)
library(kernlab)    
library(MASS)
library(dplyr)
library(knitr)
library(kableExtra)

Local sources files containing the functions

source("predict_with_uncertainty_parallel_v2.R")
source("predict_noise_confidence_interval.R")
source("classify_sample_with_limits.R")
source("plot_functions.R")

2D simulations with linear separation pattern

set.seed(1234)
n_points <- 10  # Number of points per class
n_replicates <- 5  # Number of replicates 
noise_sd <- 0.1  # Standard deviation 
rho <- 0 # Correlation coefficient

# Define the covariance matrix
cov_matrix <- matrix(c(noise_sd^2, rho * noise_sd^2, 
                       rho * noise_sd^2, noise_sd^2), nrow = 2)

# Function to generate 2D points with correlated noise
generate_class <- function(n_points, n_replicates, center_x, center_y, cov_matrix) {
  points <- matrix(0, nrow = n_points * n_replicates, ncol = 2)
  
  for (i in 1:n_points) {
    base_point <- c(center_x + rnorm(1, 0, 0.2), center_y + rnorm(1, 0, 0.2)) 
    
    for (j in 1:n_replicates) {
      noise <- mvrnorm(1, mu = c(0, 0), Sigma = cov_matrix)  # Correlated noise
      points[(i - 1) * n_replicates + j, ] <- base_point + noise 
    }
  }
  return(points)
}

class1 <- generate_class(n_points, n_replicates, center_x = 0.8, center_y = 1.6, cov_matrix)
class2 <- generate_class(n_points, n_replicates, center_x = 1.3, center_y = 1.2, cov_matrix)

data <- data.frame(
  Class = factor(rep(c("Class 1", "Class 2"), each = n_points * n_replicates)),
  Sample_ID = rep(1:(2 * n_points), each = n_replicates),
  X1 = c(class1[, 1], class2[, 1]),
  X2 = c(class1[, 2], class2[, 2])
)

# Mean center the data
data$X1 <- data$X1 - mean(data$X1)
data$X2 <- data$X2 - mean(data$X2)

# Calculate the mean and standard deviation of each sample
mean_data <- aggregate(cbind(X1, X2) ~ Class + Sample_ID, data = data, FUN = mean)
sd_data <- aggregate(cbind(X1, X2) ~ Class + Sample_ID, data = data, FUN = sd)

rm(class1, class2)

Fitting the PLS-DA model

pls_model <- train(y = data$Class,
                   x = data[,3:4], 
                   method = "pls")

Estimating the uncertainty of each prediction

The predict_with_uncertainty_parallel() function is used to estimate the uncertainty of each sample.

loc_pred <- predict_with_uncertainty_parallel(data[,-1],
                                                pls_model,
                                                n_sim = 500,
                                                return_simulations = T)

This function returns the mean and standard deviation of the \(\hat{y}\) of each sample based on the Monte Carlo simulation. It also returns the area above and bellow the decision boundary (in this case, zero) of each distribution. The classification output indicates if the sample is being classified with certainty or not. For samples where the distribution crosses the decision boundary (considering the 95% confidence level), the sample will be assigned as uncertain. Once ’ return_simulations = T’, the augmented noise will be saved. The results for this 2D simulated data is displayed in

Table 1. Results of the predict_with_uncertainty_parallel() function.
class sample y_mean y_sd area_above_0 area_below_0 classification
Class 1 1 0.4786 0.1032 1.000 0.000 Class 1
Class 1 2 0.4537 0.1195 1.000 0.000 Class 1
Class 1 3 0.0986 0.0963 0.847 0.153 Uncertain
Class 1 4 0.4058 0.0813 1.000 0.000 Class 1
Class 1 5 0.2430 0.1179 0.980 0.020 Class 1
Class 1 6 0.5741 0.1117 1.000 0.000 Class 1
Class 1 7 0.4035 0.1193 1.000 0.000 Class 1
Class 1 8 0.2370 0.1309 0.965 0.035 Uncertain
Class 1 9 0.6676 0.1423 1.000 0.000 Class 1
Class 1 10 0.1896 0.0734 0.995 0.005 Class 1
Class 2 1 -0.5636 0.1685 0.000 1.000 Class 2
Class 2 2 -0.2418 0.1437 0.046 0.954 Uncertain
Class 2 3 -0.5007 0.0723 0.000 1.000 Class 2
Class 2 4 -0.6533 0.1070 0.000 1.000 Class 2
Class 2 5 -0.2090 0.1529 0.086 0.914 Uncertain
Class 2 6 0.0334 0.2122 0.562 0.438 Uncertain
Class 2 7 -0.2577 0.1542 0.047 0.953 Uncertain
Class 2 8 -0.4661 0.1486 0.001 0.999 Class 2
Class 2 9 -0.4444 0.1509 0.002 0.998 Class 2
Class 2 10 -0.4536 0.1204 0.000 1.000 Class 2

These results can be visualized, as showed in Figure 1.

Figure 1. Uncertainty of the classification outputs for the PLS-DA model. (a) 2D data simulation. Circles represents the replicates, while the triangles represents the mean of replicates. (b) Augmented noise generated through eigen decomposition of the original covariance matrix for each sample. 50 points are showed for each sample. (c) Estimated uncertainty for each sample.

Figure 1. Uncertainty of the classification outputs for the PLS-DA model. (a) 2D data simulation. Circles represents the replicates, while the triangles represents the mean of replicates. (b) Augmented noise generated through eigen decomposition of the original covariance matrix for each sample. 50 points are showed for each sample. (c) Estimated uncertainty for each sample.

Global Uncertainty

It the model is linear and the covariances matrices are similar across all the samples in the data set, the uncertainty estimation can be estimated at once. In that case, instead of verifying if the uncertainty of a sample is distributed across the decision boundary, the uncertainty is placed on the decision boundary and it is verified if a future prediction is within the decision boundary confidence intervals. In that case, the ‘predict_noise_confidence_interval’ is used to estimate the limits of the decision boundary with 95% of confidence.

global_limits <- predict_noise_confidence_interval(pls_model,
                                                   data$Sample_ID,
                                                   n_sim = 500,
                                                   return_simulations = T)
data.frame(global_limits$lower,
           global_limits$upper)
##   global_limits.lower global_limits.upper
## 1          -0.2283833           0.2283833

Then, the classification with confidence for future samples can be calculated using the ‘classify_sample_with_limits()’ function.

global_pred <- classify_sample_with_limits(mean_data[,3:4],
                                           pls_model,
                                           global_limits$upper)
Table 2. Results of the classify_sample_with_limits() function.
class sample y_mean classification
Class 1 1 0.4732621 Class 1
Class 1 2 0.4588093 Class 1
Class 1 3 0.1024438 Uncertain
Class 1 4 0.4031910 Class 1
Class 1 5 0.2484593 Class 1
Class 1 6 0.5794556 Class 1
Class 1 7 0.4073326 Class 1
Class 1 8 0.2327395 Class 1
Class 1 9 0.6634845 Class 1
Class 1 10 0.1912483 Uncertain
Class 2 1 -0.5621914 Class 2
Class 2 2 -0.2395806 Class 2
Class 2 3 -0.4979997 Class 2
Class 2 4 -0.6519285 Class 2
Class 2 5 -0.2083974 Uncertain
Class 2 6 0.0288956 Uncertain
Class 2 7 -0.2678114 Class 2
Class 2 8 -0.4615743 Class 2
Class 2 9 -0.4522474 Class 2
Class 2 10 -0.4475907 Class 2

These results can also be visualized as showed in Figure 2.

Figure 1. Uncertainty of the classification outputs for the PLS-DA model. (a) 2D data simulation. Circles represents the replicates, while the triangles represents the mean of replicates. (b) Augmented noise generated through eigen decomposition of the original covariance matrix for all the samples. 50 points are showed for each sample. (c) Estimated uncertainty for the model.

Figure 1. Uncertainty of the classification outputs for the PLS-DA model. (a) 2D data simulation. Circles represents the replicates, while the triangles represents the mean of replicates. (b) Augmented noise generated through eigen decomposition of the original covariance matrix for all the samples. 50 points are showed for each sample. (c) Estimated uncertainty for the model.

Correlated noise

Positive

##   global_limits.lower global_limits.upper
## 1          -0.1144737           0.1144737
. .
Table 1. Results of the predict_with_uncertainty_parallel() function.
class sample y_mean y_sd area_above_0 area_below_0 classification
Class 1 1 0.5374 0.0739 1.000 0.000 Class 1
Class 1 2 0.4753 0.0890 1.000 0.000 Class 1
Class 1 3 0.1870 0.0323 1.000 0.000 Class 1
Class 1 4 0.5273 0.0459 1.000 0.000 Class 1
Class 1 5 0.2880 0.0688 1.000 0.000 Class 1
Class 1 6 0.5282 0.0766 1.000 0.000 Class 1
Class 1 7 0.4430 0.0421 1.000 0.000 Class 1
Class 1 8 0.1938 0.0402 1.000 0.000 Class 1
Class 1 9 0.6705 0.0546 1.000 0.000 Class 1
Class 1 10 0.2278 0.0479 1.000 0.000 Class 1
Class 2 1 -0.6225 0.0386 0.000 1.000 Class 2
Class 2 2 -0.1966 0.0750 0.004 0.996 Class 2
Class 2 3 -0.5918 0.0673 0.000 1.000 Class 2
Class 2 4 -0.6289 0.0318 0.000 1.000 Class 2
Class 2 5 -0.2599 0.1184 0.014 0.986 Class 2
Class 2 6 0.0024 0.0791 0.512 0.488 Uncertain
Class 2 7 -0.3258 0.0624 0.000 1.000 Class 2
Class 2 8 -0.5234 0.0700 0.000 1.000 Class 2
Class 2 9 -0.5954 0.0445 0.000 1.000 Class 2
Class 2 10 -0.3397 0.0607 0.000 1.000 Class 2
.. . .
Table 2. Results of the classify_sample_with_limits() function.
class sample y_mean classification
Class 1 1 0.5335163 Class 1
Class 1 2 0.4790602 Class 1
Class 1 3 0.1866582 Class 1
Class 1 4 0.5251782 Class 1
Class 1 5 0.2920531 Class 1
Class 1 6 0.5317339 Class 1
Class 1 7 0.4417269 Class 1
Class 1 8 0.1938483 Class 1
Class 1 9 0.6684222 Class 1
Class 1 10 0.2284636 Class 1
Class 2 1 -0.6234872 Class 2
Class 2 2 -0.1975284 Class 2
Class 2 3 -0.5898106 Class 2
Class 2 4 -0.6288601 Class 2
Class 2 5 -0.2594503 Class 2
Class 2 6 0.0034470 Uncertain
Class 2 7 -0.3277667 Class 2
Class 2 8 -0.5259511 Class 2
Class 2 9 -0.5934532 Class 2
Class 2 10 -0.3378002 Class 2
. .
Figure 1. Uncertainty of the classification outputs for the PLS-DA model. (a) 2D data simulation. Circles represents the replicates, while the triangles represents the mean of replicates. (b) Augmented noise generated through eigen decomposition of the original covariance matrix for each sample. 50 points are showed for each sample. (c) Estimated uncertainty for each sample.

Figure 1. Uncertainty of the classification outputs for the PLS-DA model. (a) 2D data simulation. Circles represents the replicates, while the triangles represents the mean of replicates. (b) Augmented noise generated through eigen decomposition of the original covariance matrix for each sample. 50 points are showed for each sample. (c) Estimated uncertainty for each sample.

Negative

##   global_limits.lower global_limits.upper
## 1          -0.3194751           0.3194751
. .
Table 1. Results of the predict_with_uncertainty_parallel() function.
class sample y_mean y_sd area_above_0 area_below_0 classification
Class 1 1 0.4991 0.1162 1.000 0.000 Class 1
Class 1 2 0.3653 0.1127 0.999 0.001 Class 1
Class 1 3 0.1121 0.1682 0.747 0.253 Uncertain
Class 1 4 0.3795 0.1657 0.989 0.011 Class 1
Class 1 5 0.2449 0.2412 0.845 0.155 Uncertain
Class 1 6 0.5110 0.1341 1.000 0.000 Class 1
Class 1 7 0.3753 0.2235 0.953 0.047 Uncertain
Class 1 8 0.2400 0.1407 0.956 0.044 Uncertain
Class 1 9 0.5769 0.1271 1.000 0.000 Class 1
Class 1 10 0.1016 0.1240 0.794 0.206 Uncertain
Class 2 1 -0.5008 0.2091 0.008 0.992 Class 2
Class 2 2 -0.2454 0.1652 0.069 0.931 Uncertain
Class 2 3 -0.4574 0.1451 0.001 0.999 Class 2
Class 2 4 -0.6509 0.1736 0.000 1.000 Class 2
Class 2 5 -0.1837 0.1799 0.154 0.846 Uncertain
Class 2 6 0.0637 0.1124 0.714 0.286 Uncertain
Class 2 7 -0.1857 0.1310 0.078 0.922 Uncertain
Class 2 8 -0.3682 0.1646 0.013 0.987 Class 2
Class 2 9 -0.3850 0.2260 0.044 0.956 Uncertain
Class 2 10 -0.5195 0.2286 0.012 0.988 Class 2
.. . .
Table 2. Results of the classify_sample_with_limits() function.
class sample y_mean classification
Class 1 1 0.4942182 Class 1
Class 1 2 0.3689006 Class 1
Class 1 3 0.1175247 Uncertain
Class 1 4 0.3718668 Class 1
Class 1 5 0.2601383 Uncertain
Class 1 6 0.5130588 Class 1
Class 1 7 0.3824708 Class 1
Class 1 8 0.2354206 Uncertain
Class 1 9 0.5731374 Class 1
Class 1 10 0.1053586 Uncertain
Class 2 1 -0.4980809 Class 2
Class 2 2 -0.2368394 Uncertain
Class 2 3 -0.4484043 Class 2
Class 2 4 -0.6492653 Class 2
Class 2 5 -0.1845602 Uncertain
Class 2 6 0.0605117 Uncertain
Class 2 7 -0.1948099 Uncertain
Class 2 8 -0.3645018 Class 2
Class 2 9 -0.3977226 Class 2
Class 2 10 -0.5084222 Class 2
. .
Figure 1. Uncertainty of the classification outputs for the PLS-DA model. (a) 2D data simulation. Circles represents the replicates, while the triangles represents the mean of replicates. (b) Augmented noise generated through eigen decomposition of the original covariance matrix for each sample. 50 points are showed for each sample. (c) Estimated uncertainty for each sample.

Figure 1. Uncertainty of the classification outputs for the PLS-DA model. (a) 2D data simulation. Circles represents the replicates, while the triangles represents the mean of replicates. (b) Augmented noise generated through eigen decomposition of the original covariance matrix for each sample. 50 points are showed for each sample. (c) Estimated uncertainty for each sample.